Keyword Location in Noisy Document Images

نویسنده

  • Jonathan J. Hull
چکیده

It may be difficult to locate keywords in noisy document images because of degraded OCR performance. A new technique for word image matching has the potential to select those. word images in a document that . represent potential keywords and to generate improved prototypes for those keywords. No explicit recognition is pe~formed in this process, but better OCR performance will occur on the improved prototypes' than • would occur on any of the isolated" words. The proposed method for keyword selection and recognition is best suited for document indexing in an image-based document retrieval system.. This paper presents an algorithm for word image clustering and discusses how it is applied to locate groups of equivalent word images in a document. Improved prototypes are generated for clusters that represent potential keywords. The results of applying the algorithm to an article in the Brown Corpus are given. The keywdrds chosen by this' approach and those chosen from the ASCII text of the article by a conventional keyword selection methodology are compared. The potential for improvementin recognition performance on those keywords is also demonstrated.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Document Image Retrieval Based on Keyword Spotting Using Relevance Feedback

Keyword Spotting is a well-known method in document image retrieval. In this method, Search in document images is based on query word image. In this Paper, an approach for document image retrieval based on keyword spotting has been proposed. In proposed method, a framework using relevance feedback is presented. Relevance feedback, an interactive and efficient method is used in this paper to imp...

متن کامل

Image-based keyword recognition in oriental language document images

-An algorithm is presented for keyword recognition in Oriental language document images. The objective is to recognize keywords composed of more than one consecutive character in document images where there are no explicit visually defined word boundaries. The technique exploits the redundancy expressed by the difference between the number of possible character strings of a fixed length and the...

متن کامل

Noisy images edge detection: Ant colony optimization algorithm

The edges of an image define the image boundary. When the image is noisy, it does not become easy to identify the edges. Therefore, a method requests to be developed that can identify edges clearly in a noisy image. Many methods have been proposed earlier using filters, transforms and wavelets with Ant colony optimization (ACO) that detect edges. We here used ACO for edge detection of noisy ima...

متن کامل

Keyword Spotting on Korean Document Images by Matching the Keyword Image

In this paper, we propose a keyword spotting system for Korean document images and compare the proposed system with an OCR-based document retrieval system. The system is composed of character segmentation, feature extraction for the query keyword, and word-to-word matching. In the character segmentation step, we propose an effective method to resolve the connection between adjacent characters. ...

متن کامل

Automatic Borders Detection of Camera Document Images

When capturing a document using a digital camera, the resulting document image is often framed by a noisy black border or includes noisy text regions from neighbouring pages. In this paper, we present a novel technique for enhancing the document images captured by a digital camera by automatically detecting the document borders and cutting out noisy black borders as well as noisy text regions a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1993